Assessment of performance characteristics of COVID-19 ICD-10-CM diagnosis code U07.1 using SARS-CoV-2 nucleic acid amplification test results

The Food and Drug Administration’s Biologics Effectiveness and Safety Initiative conducts active surveillance to protect public health during the coronavirus disease 2019 (COVID-19) pandemic. This study evaluated performance of International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis code U07.1 in identifying COVID-19 cases in claims compared with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) nucleic acid amplification test results in linked electronic health records (EHRs). Care episodes in three populations were defined using COVID-19-related diagnoses (population 1), SARS-CoV-2 nucleic acid amplification test procedures (population 2), and all-cause hospitalizations (population 3) in two linked claims-EHR databases: IBM® MarketScan® Explorys® Claims-EMR Data Set (commercial) and OneFlorida Data Trust linked Medicaid-EHR. Positive and negative predictive values were calculated. Respectively, populations 1, 2, and 3 included 26,686, 26,095, and 2,564 episodes (commercial) and 29,117, 23,412, and 9,629 episodes (Florida Medicaid). The positive predictive value was >80% and the negative predictive value was >95% in each population, with the highest positive predictive value in population 3 (commercial: 91.9%; Medicaid: 93.1%). Findings did not vary substantially by patient age. Positive predictive values in populations 1 and 2 fluctuated during April–June 2020. They then stabilized in the commercial but not the Medicaid population. Negative predictive values were consistent over time in all populations and databases. Our findings indicate that U07.1 has high performance in identifying COVID-19 cases and noncases in claims databases. Performance may vary across populations and periods.


Introduction
Coronavirus disease 2019 , caused by the novel strain of coronavirus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has significantly affected the United States and countries throughout the world during the ongoing pandemic. A new diagnosis code for COVID-19 infections, U07.1, was introduced in the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) on April 1, 2020 [1,2]. ICD-10-CM coding guidelines specify that U07.1 should be used for a confirmed diagnosis based on a positive lab test or a provider's documentation. In most real-world healthcare data, such as administrative claims, access to a provider's notes and documentation is limited. Claims data enriched with linked electronic health records (EHRs), such as the data used in this study, include lab test results and thus allow for the assessment of the performance characteristics of the U07.1 code.
The Biologics Effectiveness and Safety (BEST) Initiative is an active surveillance program in the Food and Drug Administration's Center for Biologics Evaluation and Research (CBER). This program contributes to CBER's mission to evaluate and ensure the safety and effectiveness of biologic products, including vaccines. Understanding the performance characteristics of code U07.1 in administrative claims data sources to distinguish COVID-19 cases and noncases is important for conducting COVID-19 studies using such data, including postauthorization or postapproval evaluation of COVID-19 vaccine effectiveness.
Prior studies have evaluated the U07.1 code's performance in identifying COVID-19 cases in hospital discharge data [3] but not in outpatient settings or administrative claims data. This study evaluated the performance characteristics of the ICD-10-CM code U07.1 in all settings among three care-seeking populations identified in the claims data. We used two linked claims-EHR data sources, consisting of commercially insured and Medicaid-insured individuals, respectively. In the claims portion of the linked data, we identified care-seekers and then classified them into COVID-19 cases and noncases on the basis of the presence or absence of the U07.1 code, respectively. The COVID-19 case status was verified by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) nucleic acid amplification test (NAAT) results in the EHR portion of the linked data. Within each study population, the positive predictive value (PPV) and negative predictive value (NPV) of U07.1 were estimated. PPV and NPV were also estimated monthly to evaluate temporal trends of the performance characteristics. These results will inform future observational studies in which the U07.1 code is used to identify COVID-19 cases and noncases in administrative claims data.

Data source
For the main analyses, we used two linked claims-EHR databases: IBM1 MarketScan1 Exp-lorys1 Claims-EMR (Electronic Medical Record) Data Set (CED) and OneFlorida Data Trust linked Medicaid-EHR. Because these databases contain deidentified data and are fully compliant with U.S. privacy laws and regulations (i.e., the Health Insurance Portability and Accountability Act), this study was exempt from institutional review board approval.
CED consists of deterministically linked claims (from the IBM MarketScan Commercial Database) and EHRs (from the IBM Explorys EHR Database) of the same individuals. The MarketScan Commercial Database contains data of commercially insured individuals from a selection of large employers, health plans, and government and public organizations. The Explorys EHR Database collects EHRs from more than 30 healthcare systems, spanning academic centers and community practices.
The OneFlorida Data Trust linked Medicaid-EHR contains claims data for Floridians enrolled in Medicaid and EHRs from public and private healthcare systems. Individuals are deterministically linked. EHRs came from integrated healthcare delivery networks, 13 large hospitals, and ambulatory care and primary care facilities.
To assess selection bias, we used the full MarketScan Commercial Database and OneFlorida Data Trust, the source claims databases from which the linked claims-EHR databases were created.

Study period
The study period was April 1-December 31, 2020, for the commercially insured populations and April 1-November 30, 2020, for the Florida Medicaid populations.

Study populations
Identify care episodes in claims portion of the claims-EHR linked data. We defined three populations of care episodes based on healthcare encounters in the claims portion of the linked data. A care episode identifies a cluster of events likely reflecting the continuum of care for an individual. One individual could have multiple care episodes. Population 1 consisted of episodes with a diagnosis of COVID-19, COVID-19-related symptoms, or suspected COVID-19 exposure (S1 Table 1 in S1 File). This list of codes includes ICD-10-CM diagnoses corresponding to the Centers for Disease Control and Prevention COVID-like illness definition (cough, fever, shortness of breath) [4] and additional codes that are potentially related to COVID-19 based on consultation with our physician consultants. The additional codes included COVID-19 diagnosis (U07.1; to capture cases without reported symptoms), other relevant conditions (e.g., respiratory failure) and symptoms (e.g., loss of smell and taste), and potential COVID-19 exposure. Population 2 consisted of episodes with a SARS-CoV-2 NAAT procedure code (S1 Table 2 in S1 File). Population 3 consisted of all-cause hospitalizations. Table 1 presents the criteria for grouping claims into episodes, including assignment of the index event (population 1: first encounter with a diagnosis of COVID-19, COVID-19-related symptoms, or suspected COVID-19 exposure; population 2: first NAAT procedure code; population 3: first all-cause hospitalization) and episode start and end dates (i.e., a date before the index date and after the last relevant encounter, respectively, to define a period during which NAAT results on the linked EHR are likely relevant). Only individuals aged <65 years at the index event with continuous coverage of medical benefits (allowing for a coverage gap up to 31 days) during the episodes and 6 months prior to the start of the episodes (baseline period) were retained.
Identify NAAT results in EHR portion of the claims-EHR linked data. Care episodes identified from the prior step were examined for SARS-CoV-2 NAAT results in the linked EHRs. Logical Observation Identifiers Names and Codes (LOINC) (S1 Table 3 in S1 File) were used to identify the SARS-CoV-2 NAAT results, and the observation date associated with the LOINC was the test result date. Episodes with at least one NAAT result during the episode were selected for the final three study populations (Table 1).

U07.1 diagnosis status classification
We classified the study populations into U07.1-positive and U07.1-negative episodes based on the code's presence or absence in claims during the episodes. Diagnoses from all care settings were used to classify study populations 1 and 2. Inpatient discharge diagnoses were used to classify study population 3.

Performance characteristics
The PPV was the proportion of U07.1-positive episodes confirmed by a positive SARS-CoV-2 NAAT result in EHRs during the episode (when multiple NAAT results were present, one positive result was sufficient to confirm a positive case). The NPV was the proportion of U07.1-negative episodes confirmed by negative SARS-CoV-2 NAAT results in EHRs (when multiple NAAT results were present, all must be negative). PPV and NPV were calculated in each study population and stratified by calendar month, index care setting (inpatient, emergency department [ED], outpatient, and other settings-defined as index event occurring at an inpatient, ED, or outpatient setting or at a setting that is none of the above, respectively), and patient age at the index event. We calculated confidence intervals (CIs) for the PPVs and NPVs using the Agresti-Coull interval [5].

Assessment of selection bias
Using the same methods that derived the study populations, we identified corresponding source populations in the full claims databases. We compared the source populations with the study populations for the distribution of patient demographics (e.g., age), clinical characteristics (e.g., Deyo-Charlson Comorbidity Index [CCI] [6, 7]), and healthcare utilization (e.g., ED visits). An absolute standardized mean difference (SMD) >0.20 was used as the threshold to denote potentially meaningful differences between the study and source populations.

Construct episodes
Events were grouped into an episode when diagnosis dates are within 14 days of each other. An episode ends when there is a gap of >14 days between a diagnosis and any subsequent ones. When an episode extended into a hospitalization, the episode extended from the admission date to the discharge date.
Tests performed within 21 days of each other were grouped together as an episode. An episode ends when there is a gap of more than 21 days between a test and any subsequent ones.
Consecutive hospitalizations with a subsequent admission date within 1 day of the prior discharge date were connected to construct hospitalization episodes.

Define index event
First encounter with a relevant symptom or diagnosis in an episode Individuals must be <65 years of age on the index event date.

Require continuous enrollment
Individuals must be continuously enrolled with medical benefits during the episode and during the 6-month baseline period prior to the episode. An enrollment gap of 31+ days was allowed.

Final study populations
Limit to episodes with at least one observed positive or negative SARS-CoV-2 NAAT result in the linked EHRs during the episodes.

PLOS ONE
Validating U07.1 COVID-19 diagnosis code in claims   In both the commercially insured and Florida Medicaid populations, the distribution of NAATs as identified by LOINC codes was similar among study populations 1, 2, and 3 (S2 Table 1A and 1B in S2 File). The most common LOINC used in both databases was 94500-6 (SARS coronavirus 2 RNA [Presence] in Respiratory specimen by NAA with probe detection).

Performance characteristics
The PPV of the COVID-19 diagnosis code U07.1 was above 80%, with narrow 95% CIs in each of the three study populations for the commercially insured and Florida Medicaid data In both databases, the PPVs stratified by patient age and index care setting generally had narrow 95% CIs, except for the PPV from the inpatient subset of population 2 due to a small sample size. PPVs did not vary greatly by patient age in all three populations. However, PPVs showed variations by index care setting. In population 1, the PPVs were lower among episodes in an inpatient index care setting than those with an ED or outpatient setting for both the com- . After limiting these episodes to those with a U07.1 diagnosis from the inpatient setting, PPVs increased to 89.7% (commercial) and 92.6% (Medicaid). In population 2, episodes with an inpatient index care setting were rare. The PPVs were similar between noninpatient settings among the commercially insured but were highest for the ED setting (87.7%) followed by outpatient (77.7%) and other (59.9%) settings among the Florida Medicaid population.
The NPV was high across all three study populations in both data sources with narrow 95% CIs. Populations 1 and 2 had similar NPVs (commercial: 96.8% and 95.6%, respectively; Medicaid: 98.1% and 98.4%, respectively). The NPV in population 3 was the highest (commercial: 99.5%; Medicaid: 99.7%). In both the commercial and Florida Medicaid data, the NPV did not vary greatly by patient age. By index care setting (populations 1 and 2), NPVs had similar

PLOS ONE
Validating U07.1 COVID-19 diagnosis code in claims patterns between the commercial and Florida Medicaid data: NPVs were high (>92%) across all index care settings, with the lowest values observed for episodes with an index care setting in the other category. Stratified NPVs also had narrow 95% CIs, except for the NPV of the inpatient subset of population 2 due to a small sample size.  S2 Table 3A and 3B in S2 File for underlying data). The 95% CIs were wide for the monthly PPVs but narrow for the NPVs, primarily due to the larger size of the monthly U07.1-negative episodes than the U07.1-positive episodes. For the commercially insured, PPVs in populations 1 and 2 showed a marked decrease early in the pandemic, around May-June 2020 (58.3-75.3%) but increased in July-December (population 1: 84.2-92.4%; population 2: 89.7-95.2%). PPVs were more stable in population 3 during April-December (ranging from 87.5 to 100.0%) despite the wide 95% CIs. In Florida Medicaid, PPVs fluctuated in populations 1 and 2 and, to a lesser extent, in population 3. In all populations, NPVs were consistently >90% over time.

Assessment of selection bias
S2 Table 4A (commercially insured) and 4b (Florida Medicaid) in S2 File compare the source populations and the corresponding study populations to assess the potential selection bias introduced from subsetting source populations to the study populations with a linked EHR and a SARS-CoV-2 NAAT result.
Among both the commercially insured and Florida Medicaid enrollees, study populations generally resembled the source populations for most characteristics evaluated, with a few differences noted below.
Among Florida Medicaid, compared with the corresponding source populations, mean baseline CCI was higher in study population 1 (0.8 vs. 0.5) and study population 2 (0.7 vs. 0.5) but similar in study population 3. Study population 1 was more likely to have a hospitalization during the baseline period than the corresponding source population (18.4% vs. 10.7%). However, all three study populations had similar clinical profiles for symptoms and potential COVID-19 complications as the source populations. Compared with the source population, study population 1 was more likely to have an inpatient index event (16.7% vs. 4.9%) and study population 2 was more likely to have an outpatient index event (56.7% vs. 37.1%). Study population 1 was more likely to have a COVID-19 diagnosis than the source population (10.2% vs. 4.1%). Like the commercially insured, Florida Medicaid episodes with suspected COVID-19 exposure diagnoses were overrepresented in study populations 1 and 3. Overall, the study populations were generally representative of the source populations except for several characteristics noted above, which may potentially have introduced some selection bias.

Discussion
Evidence suggests the ICD-10-CM U07.1 diagnosis code was broadly and rapidly adopted to identify COVID-19 patients in different care settings in the United States following its release on April 1, 2020 [8], but there is scant evidence regarding the performance characteristics of the diagnosis code in different care settings. We estimated the PPVs and NPVs of the COVID-19 diagnosis code U07.1 for three patient groups in the claims portion of two linked claims-EHR databases consisting of commercially insured and Medicaid-insured individuals. SARS-CoV-2 NAAT results from the EHR portion of the databases were used as the reference method. In each study population, the overall PPVs (>80%) and NPVs (>95%) were high and did not vary substantially by patient age but varied by healthcare setting.
For both the commercial insurance and Medicaid databases, PPVs and NPVs were similar in hospitalized individuals (population 3). They were also the highest in population 3, compared with the two other populations that were based on diagnosis of COVID-19, symptoms, or potential exposure (population 1) or NAAT procedure codes (population 2). PPVs and NPVs estimated in population 3 were consistent with other reported results for hospitalized patients: PPV: 91.3%, NPV: 99.8% (Kadri et al. [3]); PPV: 86% (Brown [9]); and PPV: 99.5%, NPV: 99.7% (Blatz et al. [10], which was among children). Our results and existing research indicate that the performance of the U07.1 code is higher in the inpatient setting. The PPVs and NPVs in populations 1 and 2 were similar and high for both the commercial insurance and Medicaid databases, although the PPV and NPV were both lower than those in population 3. The U07.1 code performance was overall poorer in Medicaid versus commercial insurance claims data.
For populations 1 and 2, PPVs were higher among populations with commercial insurance than those with Florida Medicaid. This may reflect that coding accuracy of COVID-19 is lower among Florida Medicaid than commercial insurance during the study period. However, many factors could contribute to this observed difference, such as different data processing systems or Florida's local disease prevalence versus disease prevalence among the multiple geographic regions represented in the commercial insurance data. PPVs varied by month, with greater variations in populations 1 and 2 and early in the pandemic. This instability lasted longer in the Medicaid data. Fluctuation of PPVs may be related to disease prevalence and coding practice changes. NPVs appeared stable in the study period for all populations evaluated.
Our study has several strengths. We used two large-scale linked claims-EHR databases representing commercially and publicly insured individuals. The use of linked claims-EHR databases to validate claims-based diagnoses is more efficient and cost-effective than the conventional medical chart review process. Additionally, NAAT results recorded on linked EHRs are the gold standard to ascertain COVID-19 case status, and as part of the structured EHR components, test results allow case adjudication to be fully automated. The BEST network brought together a broad range of real-world data, including linked claims-EHR data used in this study, which facilitated rapid assessment of the performance characteristics of the ICD-10-CM diagnosis code in the current public health crisis. We evaluated three care-seeking populations from which U07.1-positive and U07.1-negative cases were identified. The careseeking U07.1-negative episodes allowed the estimation of NPVs and identified COVID-19 noncases that are comparable to the cases in terms of the type of care-seeking encounters. The study results showed the performance characteristics (PPV and NPV) of U07.1 were high in all three populations, which were selected based on different types of healthcare encounters with respect to COVID-19. Thus, the results may represent diverse care-seeking populations. Outpatient encounters include all outpatient services because specific places of service (e.g., office, hospital outpatient) cannot be identified in the data. Care setting was defined using a hierarchy if services at multiple settings were found on the index event date: inpatient, ED, outpatient, other. See S2 Table 2B in S2 File for the full results with exact CIs. The PPVs for study population 1 (inpatient setting at index event) and population 3 (all-cause hospitalization) are different because study population 1 episodes include diagnoses outside the hospitalization. CI confidence interval, ED emergency department, ICD- 10 Our study has some limitations. A relatively small proportion of individuals in the administrative claims databases also had linked EHR data and were selected into the study populations. Additionally, EHR data came only from healthcare providers participating in the EHR network and thus may not include all the healthcare services received by the individuals in other healthcare facilities. Although the study populations were generally representative of the source populations for characteristics evaluated, unobserved characteristics may differ. This

PLOS ONE
Validating U07.1 COVID-19 diagnosis code in claims study used data from the pandemic's early stages, and findings may not be generalizable to subsequent periods of the pandemic. The portability of performance characteristics of the U07.1 reported in this study depends on the prevalence of the disease in each population and characteristics of the databases.

Conclusions
The BEST program allowed rapid validation of the COVID-19 ICD-10-CM diagnosis code using linked claims-EHR databases. When compared with NAAT results from EHRs, the overall PPV and NPV of the claims-based U07.1 code were high across all three study populations and did not vary significantly by age but varied by healthcare setting. In populations 1 and 2, PPVs varied by month and were lower for the Medicaid than the commercially insured population. NPVs were consistent over time in all populations and databases. This study demonstrates that the performance of the ICD-10-CM COVID-19 diagnosis code U07.1 in administrative claims databases is strong and that the code has the potential to be used in COVID-19 observational studies, including those of vaccine safety and effectiveness. This study, as a use case, demonstrated the utility of linked claims-EHR databases as well. Future efforts to link claims and EHR databases across healthcare systems can serve to create powerful tools to evaluate the safety and effectiveness of medical products in real-world settings and to contribute to public health.
Supporting information S1 File. Clinical codes used to identify study populations. S1 Table 1 Codes used to identify study population 1 (diagnosis of COVID-19, symptoms, or potential exposure). S1 Table 2 Codes used to identify study population 2 (SARS-CoV-2 NAAT procedure in claims). S1 Table 3 Codes used to identify SARS-CoV-2 nucleic acid amplification test results in the linked electronic health records for study populations 1, 2, and 3. (DOCX) S2 File. Additional results tables. S2 Table 1A Distribution of SARS-CoV-2 NAAT nucleic acid amplification test, in linked EHR data, by study population, among the commercially insured individuals in CED. S2  Fig 4) Performance characteristics (PPV and NPV) of ICD-10-CM diagnosis code U07.1 in administrative claims data, by study population, care setting, and age at the index event, among the commercially insured individuals in CED. S2 Table 2B (corresponding to Fig 5) Performance characteristics of ICD-10-CM diagnosis code U07.1 in administrative claims data, by study population, care setting, and age at the index event, among individuals with Florida Medicaid in OneFlorida linked data. S2 Table 3A (corresponding to Fig 6) Monthly trends in performance characteristics (PPV/NPV) of diagnosis code U07.1 in claims data for commercially insured individuals in CED. S2 Table 3B (corresponding to Fig 6) Monthly trends in performance characteristics (PPV/NPV) of diagnosis code U07.1 in claims data for individuals with Florida Medicaid in OneFlorida linked data. S2 Table 4A Characteristics of study populations compared with the corresponding source populations, among individuals with commercial insurance. S2 Table 4B Characteristics of study populations compared with the corresponding source populations, among individuals with Florida Medicaid. (DOCX)